refactor: remove devicearray code to reduce complexity by cpcloud · Pull Request #600 · NVIDIA/numba-cuda

cpcloud · 2025-11-19T16:32:35Z

Overview

This PR removes the C++ extension-based DeviceArray class and replaces it with a pure Python implementation, significantly reducing codebase complexity while maintaining functionality. The changes also introduce type computation caching to mitigate potential performance overhead.

Changes

Implementation Changes

numba_cuda/numba/cuda/cudadrv/devicearray.py

DeviceNDArrayBase now inherits from object instead of _devicearray.DeviceArray
Added _numba_type_ instance attribute caching in _numba_type_ property to avoid repeated type computation
- Type computation now occurs once per array instance
- Subsequent accesses return cached value via walrus operator pattern

Performance Considerations

Removed Optimization: The C++ implementation provided fast-path type fingerprinting through direct table lookup for device arrays during dispatch:

Cached type codes indexed by [ndim-1][layout][dtype]
Avoided Python property access and type construction for common array configurations

Mitigation: Instance-level caching of _numba_type_ property reduces repeated type computation overhead:

Type is computed once per DeviceNDArrayBase instance
Subsequent kernel launches with the same array instance use cached type
Overhead primarily affects first use of each array instance

Expected Impact:

Potential slight slowdown on first kernel launch with a given array (fallback to fingerprinting instead of table lookup)
Negligible impact on subsequent launches with the same array instance due to caching
Reduced build complexity
Easier experimentation with swapping the underlying implementation with StridedMemoryView-based one

Trade-offs

Benefits:

Reduced codebase complexity (~370 lines of C++ code removed)
Simplified build process (one fewer extension module)
Easier maintenance and debugging

Costs:

Potential minor performance regression on first kernel launch with new array instances

copy-pr-bot · 2025-11-19T16:32:39Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpcloud · 2025-11-19T17:11:35Z

/ok to test

gmarkall

The code changes look good.

I need to spend a few minutes investigating a bit, because I am surprised that the "fast" path seemed to be slower than the default one. But, things have changed a lot since I initially implemented it, so it's also plausible we've drifted into a place where it no longer helps.

numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py

cpcloud · 2025-11-19T17:21:33Z

I am surprised that the "fast" path seemed to be slower than the default one

I will post benchmarks in a bit, but the difference isn't huge. I wouldn't have been surprised if it were the other way around (which was my initial interpretation until I realized it was the other way around).

cpcloud · 2025-11-19T18:36:59Z

/ok to test

cpcloud · 2025-11-19T18:38:29Z

/ok to test

cpcloud · 2025-11-19T18:42:04Z

Benchmarks are a little variable (here the best improvement is around 7%)

NOW is this PR
0001_1478723 is the first benchmark result set at commit 1478723 (the first commit here which contains only the new benchmarking code)

greptile-apps · 2025-11-21T10:32:48Z

Greptile Overview

Greptile Summary

This PR successfully removes the C++ extension-based DeviceArray implementation in favor of a pure Python approach, eliminating ~370 lines of C++ code and simplifying the build process. The key changes include:

Core refactor: DeviceNDArrayBase now inherits from object instead of the C++ _devicearray.DeviceArray class
Performance mitigation: Added @functools.cached_property to _numba_type_ to cache type computation per array instance
Type dispatch change: Removed fast-path table lookup (typecode_devicendarray) in favor of fingerprinting-based dispatch that leverages the Python _numba_type_ property
Build simplification: Removed _devicearray extension module from setup.py
Testing infrastructure: Added dispatch vs signature benchmarking variants to measure performance impact

The implementation is correct and safe. The cached_property decorator is appropriate since array properties (shape, strides, dtype) are immutable after construction. Device arrays will now follow the standard fingerprinting path, accessing the cached _numba_type_ property.

Confidence Score: 5/5

Safe to merge - well-structured refactoring with appropriate performance mitigation
The code changes are clean, correct, and well-documented. The removal of C++ code reduces complexity without introducing bugs. The cached_property implementation is safe because array properties are immutable. Benchmarking infrastructure confirms performance considerations were addressed.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
numba_cuda/numba/cuda/cudadrv/devicearray.py	5/5	Changed base class from C++ `_devicearray.DeviceArray` to pure Python object, added `@functools.cached_property` decorator to `_numba_type_` for performance optimization
numba_cuda/numba/cuda/cext/_dispatcher.cpp	5/5	Removed devicearray import function and initialization call, cleaned up header includes
numba_cuda/numba/cuda/cext/_typeof.cpp	5/5	Removed fast-path `typecode_devicendarray` function (~100 lines), device arrays now use fingerprinting fallback
setup.py	5/5	Removed `ext_devicearray` extension module from build configuration

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

gmarkall · 2025-11-21T12:13:49Z

Benchmarks added in a previous commit show the C extension provides minimal performance benefit (0-4% faster) which doesn't justify the added complexity.

My understanding is that the opposite is shown - the "fast path" being removed here is actually 0-4% slower than the "fallback", so removing this code is both a performance and complexity improvement.

gmarkall · 2025-11-24T18:12:37Z

I've been experimenting with this locally, in two configurations:

main branch at commit d08e8a9 and the updated pytest.ini and test_kernel_launch.py from this branch checked out, so the benchmarks run are the same.
This branch checked out, with main from the above commit merged in

The idea being that we're comparing main with and without this PR applied.

With main the timings I get are:

--------------------------------------------------------------------------------------------------------- benchmark: 8 tests --------------------------------------------------------------------------------------------------------
Name (time in us)                                   Min                     Max                    Mean              StdDev                  Median                 IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array]            748.7840 (1.0)        1,008.8570 (1.23)         776.2828 (1.00)      15.2857 (1.90)         775.0270 (1.00)       9.2160 (1.12)       116;70  1,288.1904 (1.00)       1158           1
test_one_arg[signature-device_array]           753.5910 (1.01)         819.9070 (1.0)          774.2301 (1.0)        8.0595 (1.0)          773.6250 (1.0)        8.2115 (1.0)          10;2  1,291.6056 (1.0)          67           1
test_one_arg[dispatch-cupy]                  1,839.6880 (2.46)       2,325.0710 (2.84)       1,903.9161 (2.46)      32.7335 (4.06)       1,898.5565 (2.45)      22.8490 (2.78)        60;35    525.2332 (0.41)        486           1
test_one_arg[signature-cupy]                 5,387.6110 (7.20)       5,718.3980 (6.97)       5,498.9842 (7.10)      53.1874 (6.60)       5,496.8680 (7.11)      66.7358 (8.13)         16;2    181.8518 (0.14)         69           1
test_many_args[dispatch-device_array]        6,061.3230 (8.09)       6,354.0870 (7.75)       6,133.8225 (7.92)      33.8836 (4.20)       6,131.5130 (7.93)      38.2100 (4.65)         31;3    163.0305 (0.13)        160           1
test_many_args[signature-device_array]       5,959.2460 (7.96)       6,214.8450 (7.58)       6,039.5247 (7.80)      41.9748 (5.21)       6,036.7950 (7.80)      48.5348 (5.91)         10;1    165.5759 (0.13)         53           1
test_many_args[dispatch-cupy]               26,816.2350 (35.81)     27,309.8710 (33.31)     26,934.2632 (34.79)     90.4253 (11.22)     26,917.1950 (34.79)     76.1835 (9.28)          5;2     37.1274 (0.03)         35           1
test_many_args[signature-cupy]             101,691.6180 (135.81)   102,231.6370 (124.69)   101,974.4584 (131.71)   201.6629 (25.02)    102,016.2960 (131.87)   333.7150 (40.64)         3;0      9.8064 (0.01)          9           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With this branch, I get:

---------------------------------------------------------------------------------------------------------- benchmark: 8 tests ---------------------------------------------------------------------------------------------------------
Name (time in us)                                   Min                     Max                    Mean              StdDev                  Median                   IQR            Outliers         OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array]            740.8460 (1.0)          962.4230 (1.0)          772.3301 (1.0)       16.1450 (1.0)          769.9750 (1.0)          9.2210 (1.0)       250;171  1,294.7831 (1.0)        1135           1
test_one_arg[signature-device_array]         1,261.4590 (1.70)       1,383.1550 (1.44)       1,330.0216 (1.72)      26.6591 (1.65)       1,329.2700 (1.73)        37.7452 (4.09)         13;0    751.8675 (0.58)         45           1
test_one_arg[dispatch-cupy]                  1,827.6040 (2.47)       2,132.8190 (2.22)       1,880.6175 (2.43)      32.6976 (2.03)       1,874.8390 (2.43)        22.8098 (2.47)        60;22    531.7402 (0.41)        465           1
test_one_arg[signature-cupy]                 5,374.7870 (7.25)       6,002.8690 (6.24)       5,512.2594 (7.14)     101.2299 (6.27)       5,492.6985 (7.13)        74.6990 (8.10)          7;3    181.4138 (0.14)         50           1
test_many_args[dispatch-device_array]        5,830.0840 (7.87)       6,124.8530 (6.36)       5,918.4807 (7.66)      51.1397 (3.17)       5,900.9190 (7.66)        66.8490 (7.25)         44;2    168.9623 (0.13)        164           1
test_many_args[signature-device_array]      17,224.7620 (23.25)     17,850.9780 (18.55)     17,392.8900 (22.52)    146.1040 (9.05)      17,359.4530 (22.55)      220.6965 (23.93)         5;1     57.4948 (0.04)         25           1
test_many_args[dispatch-cupy]               26,270.7640 (35.46)     26,632.8730 (27.67)     26,389.2848 (34.17)     75.3687 (4.67)      26,380.0700 (34.26)       67.9575 (7.37)          7;3     37.8942 (0.03)         37           1
test_many_args[signature-cupy]             103,262.9600 (139.39)   105,188.3970 (109.30)   104,006.8155 (134.67)   740.4433 (45.86)    103,701.8830 (134.68)   1,205.4975 (130.73)        3;0      9.6148 (0.01)          8           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So it looks like performance is mostly the same, except for in test_many_args[signature-device_array], which is about 3x slower with this branch.

I'm looking into why this could be.

gmarkall · 2025-11-24T18:14:17Z

Also, it seems that this branch is faster in all cases with the dispatch variant of the tests. This is the more important use case (it's not normally recommended to add signatures to kernels).

cpcloud · 2025-12-02T16:26:15Z

There's a straightforward way to see the comparisons against each other, so that it's clear where the differences are:

git checkout $THIS_BRANCH
git checkout HEAD~ # the commit with just the benchmarks
pixi run -e cu-12-9-py312 bench
git checkout $THIS_BRANCH
pixi reinstall -e cu-12-9-py312 numba-cuda # to ensure that the changes to the C extensions are compiled and installed into the environment
pixi run -e cu-12-9-py312 benchcmp 0001 # 0001 is an autogenerated name from the previous run

which results in

I may wrap this up in a pixi task to reduce the number of steps needed to do the comparison.

greptile-apps

_{7 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cpcloud · 2025-12-05T18:47:06Z

/ok to test

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cpcloud · 2025-12-05T19:09:55Z

/ok to test

greptile-apps

Additional Comments (2)

numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py, line 42 (link)

syntax: IDs are swapped - cuda.jit is dispatch mode, cuda.jit("void(float32[::1])") is signature mode
numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py, line 96 (link)

syntax: IDs are swapped here too - cuda.jit is dispatch mode, cuda.jit("void(...)") is signature mode

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

cpcloud · 2025-12-05T19:16:45Z

/ok to test

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cpcloud · 2025-12-05T19:37:57Z

Here are the new benchmarks:

At most there's a 7% slowdown, likely from the initial cost of computing the numba type.

numba_cuda/numba/cuda/cudadrv/devicearray.py

…utation of type

cpcloud · 2025-12-08T15:35:38Z

/ok to test

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

- Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)

- Capture global device arrays in kernels and device functions (#666) - Fix #624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (#643) - Fix Issue #588: separate compilation of NVVM IR modules when generating debuginfo (#591) - feat: allow printing nested tuples (#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (#655) - build(deps): bump actions/upload-artifact from 4 to 5 (#652) - Test RAPIDS 25.12 (#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (#662) - feat: add print support for int64 tuples (#663) - Only run dependabot monthly and open fewer PRs (#658) - test: fix bogus `self` argument to `Context` (#656) - Fix false negative NRT link decision when NRT was previously toggled on (#650) - Add support for dependabot (#647) - refactor: cull dead linker objects (#649) - Migrate numba-cuda driver to use cuda.core.launch API (#609) - feat: add set_shared_memory_carveout (#629) - chore: bump version in pixi.toml (#641) - refactor: remove devicearray code to reduce complexity (#600)

v0.23.0 - Capture global device arrays in kernels and device functions (NVIDIA#666) - Fix NVIDIA#624: Accept Numba IR nodes in all places Numba-CUDA IR nodes are expected (NVIDIA#643) - Fix Issue NVIDIA#588: separate compilation of NVVM IR modules when generating debuginfo (NVIDIA#591) - feat: allow printing nested tuples (NVIDIA#667) - build(deps): bump actions/setup-python from 5.6.0 to 6.1.0 (NVIDIA#655) - build(deps): bump actions/upload-artifact from 4 to 5 (NVIDIA#652) - Test RAPIDS 25.12 (NVIDIA#661) - Do not manually set DUMP_ASSEMBLY in `nvjitlink` tests (NVIDIA#662) - feat: add print support for int64 tuples (NVIDIA#663) - Only run dependabot monthly and open fewer PRs (NVIDIA#658) - test: fix bogus `self` argument to `Context` (NVIDIA#656) - Fix false negative NRT link decision when NRT was previously toggled on (NVIDIA#650) - Add support for dependabot (NVIDIA#647) - refactor: cull dead linker objects (NVIDIA#649) - Migrate numba-cuda driver to use cuda.core.launch API (NVIDIA#609) - feat: add set_shared_memory_carveout (NVIDIA#629) - chore: bump version in pixi.toml (NVIDIA#641) - refactor: remove devicearray code to reduce complexity (NVIDIA#600)

cpcloud requested a review from gmarkall November 19, 2025 16:32

gmarkall reviewed Nov 19, 2025

View reviewed changes

numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py Outdated Show resolved Hide resolved

gmarkall reviewed Nov 19, 2025

View reviewed changes

numba_cuda/numba/cuda/tests/benchmarks/test_kernel_launch.py Outdated Show resolved Hide resolved

gmarkall added the 3 - Ready for Review Ready for review by team label Nov 19, 2025

cpcloud force-pushed the remove-devicearray-dispatching-code branch 2 times, most recently from c9b0501 to 6628ea7 Compare November 19, 2025 18:36

cpcloud force-pushed the remove-devicearray-dispatching-code branch from 6628ea7 to f320ce6 Compare November 19, 2025 18:38

greptile-apps bot reviewed Nov 21, 2025

View reviewed changes

cpcloud mentioned this pull request Dec 3, 2025

dev: add tool for comparing benchmarks on two commits #631

Merged

cpcloud force-pushed the remove-devicearray-dispatching-code branch 2 times, most recently from d598a52 to aa2a2bf Compare December 5, 2025 15:08

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

cpcloud force-pushed the remove-devicearray-dispatching-code branch from aa2a2bf to 2fd2e58 Compare December 5, 2025 17:59

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

cpcloud force-pushed the remove-devicearray-dispatching-code branch from 2fd2e58 to 3bca842 Compare December 5, 2025 18:12

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

cpcloud force-pushed the remove-devicearray-dispatching-code branch 2 times, most recently from 868376c to b89d7fd Compare December 5, 2025 18:47

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

cpcloud force-pushed the remove-devicearray-dispatching-code branch from b89d7fd to 8a6657d Compare December 5, 2025 19:09

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

cpcloud force-pushed the remove-devicearray-dispatching-code branch from e4101fe to 5d4c45b Compare December 5, 2025 19:16

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

gmarkall reviewed Dec 8, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/devicearray.py Outdated Show resolved Hide resolved

gmarkall approved these changes Dec 8, 2025

View reviewed changes

cpcloud added 6 commits December 8, 2025 10:32

chore: materialize commit

905affb

refactor: make things quiet

6f4362f

bench: add dispatching variants

9125199

refactor: remove devicearray to reduce complexity

575156f

chore: make _numba_type_ an instance attribute to avoid repeated comp…

89f1b45

…utation of type

chore: reduce code by using cached_property

edfdf00

cpcloud force-pushed the remove-devicearray-dispatching-code branch from 5d4c45b to edfdf00 Compare December 8, 2025 15:35

greptile-apps bot reviewed Dec 8, 2025

View reviewed changes

cpcloud enabled auto-merge (squash) December 8, 2025 15:42

cpcloud merged commit 15750e6 into NVIDIA:main Dec 8, 2025
71 checks passed

cpcloud deleted the remove-devicearray-dispatching-code branch December 8, 2025 16:16

gmarkall mentioned this pull request Dec 17, 2025

Bump version to 0.23.0 #668

Merged

Conversation

cpcloud commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

Implementation Changes

Performance Considerations

Trade-offs

Uh oh!

copy-pr-bot bot commented Nov 19, 2025

Uh oh!

cpcloud commented Nov 19, 2025

Uh oh!

gmarkall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cpcloud commented Nov 19, 2025

Uh oh!

cpcloud commented Nov 19, 2025

Uh oh!

cpcloud commented Nov 19, 2025

Uh oh!

cpcloud commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

gmarkall commented Nov 21, 2025

Uh oh!

gmarkall commented Nov 24, 2025

Uh oh!

gmarkall commented Nov 24, 2025

Uh oh!

cpcloud commented Dec 2, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

cpcloud commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cpcloud commented Dec 8, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Nov 19, 2025 •

edited

Loading

cpcloud commented Nov 19, 2025 •

edited

Loading

greptile-apps bot commented Nov 21, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

cpcloud commented Dec 5, 2025 •

edited

Loading